A Common Parts-of-Speech Tagset Framework for Indian Languages

نویسندگان

  • Baskaran Sankaran
  • Kalika Bali
  • Monojit Choudhury
  • Tanmoy Bhattacharya
  • Pushpak Bhattacharyya
  • Girish Nath Jha
  • S. Rajendran
  • K. Saravanan
  • L. Sobha
  • K. V. Subbarao
چکیده

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Designing a Common POS-Tagset Framework for Indian Languages

Research in Parts-of-Speech (POS) tagset design for European and East Asian languages started with a mere listing of important morphosyntactic features in one language and has matured in later years towards hierarchical tagsets, decomposable tags, common framework for multiple languages (EAGLES) etc. Several tagsets have been developed in these languages along with large amount of annotated dat...

متن کامل

BIS Annotation Standards With Reference to Konkani Language

The Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset has been prepared for the Indian Languages by the POS Tag Standardization Committee of Department of Information Technology (DIT), New Delhi, India. The BIS POS tagset aims to ensure standardization in the POS tagging of all the Indian Languages. It has been used for POS tagging in the Indian Languages Corpora Initiative (ILCI) pr...

متن کامل

Parts Of Speech Tagging for Indian Languages: A Literature Survey

Part of speech (POS) tagging is the process of assigning the part of speech tag or other lexical class marker to each and every word in a sentence. In many Natural Language Processing applications such as word sense disambiguation, information retrieval, information processing, parsing, question answering, and machine translation, POS tagging is considered as the one of the basic necessary tool...

متن کامل

The Linguistics Journal Volume 4 Issue 1 the First Paper on " Part-of-speech Tagging for Grammar Checking of Punjabi " Part-of-speech Tagging for Grammar Checking of Punjabi Noun and Modifier Agreement

Part-of-speech (POS) tagging is one of the major activities performed in a typical natural language processing application. This paper explores part-of-speech tagging for the Punjabi language, a member of the Modern Indo-Aryan family of languages. A tagset for use in grammar checking and other similar applications is proposed. This fine-grained tagset is based entirely on the grammatical catego...

متن کامل

Slovak Morphosyntactic Tagset

Morphological annotation constitutes essential, very useful and very common linguistic information presented in corpora, especially for highly inflectional languages. The morphological tagset used in the Slovak National Corpus has been designed with several goals in mind – the tags are compact and easily human-readable, without sacrificing their informational contents. The tags consist of ASCII...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008